Traditional feature selection is conducted base on feature importance. With the development of machine learning interpretion, multiple methods are generated to interpret the 'black box' of machine learning. One of the most widely used is SHAP. When I worked on this project, I was thinking maybe SHAP can also used on feature selection to improve the model.
Some researches are supporting my thought. Like https://iopscience.iop.org/article/10.1088/1742-6596/1284/1/012026/pdf. In this paper, the author shows that new feature selection method base on the SHAP values is superior to widely used prediction methods.
This notebook aims to test wether SHAP value used in feature selection is better for this sepcific project.
The objective of this project is to develop a customer resiliency score to help rank customers' resilience to some unexpected disaster or an economic downturn. This scoring model will be used on all cunsumer porfolios. It will predict the likelihood of the customer having a FICO score drop of 40 points or worse in the next six months.
import pandas as pd
from importlib import reload
import numpy as np
pd.set_option('display.max_columns', None) # or 1000
pd.set_option('display.max_rows', None) # or 1000
pd.set_option('display.max_colwidth', -1) # or
pd.set_option('float_format', '{:f}'.format)
from IPython.display import display, HTML
import warnings
warnings.filterwarnings("ignore")
df1=pd.read_parquet('../../Data/df_combined.parquet')
dfOOt=pd.read_parquet('../../Data/df_combined_oot_1snap.parquet')
df1=df1.loc[(df1['target']>=0)]
catCols=['in_cma_cus_state','in_cma_score_type_tru','in_cma_score_type','in_cma_fraud_alert_ind','in_cma_addr_disc_ind'
,'in_cma_vantage_rsn_5','in_cma_bni_rsn_5','in_cma_system','Income_Confidence'
,'in_cma_reject_ind','Income_Confidence1','Income_Confidence2'
,'Income_Confidence3']
excludeCols=['in_cma_account','in_cma_score_date','in_cma_eq_seqnum','income_date_x','income_date_y'
,'in_cma_cus_ssn','in_cma_cus_zipcode'
,'in_cma_score_date','in_cma_cus_name_1','in_cma_cus_name_2'
,'in_cma_project_id','in_cma_cus_addr_1','in_cma_cus_addr_2'
,'in_cma_cus_city','dob','BATCH','in_tsys_fill','in_acls_bank','in_acls_loan'
,'in_acls_appl','in_acls_fill','in_tsys_acct','in_tsys_fill','in_cma_cus_account_num'
,'in_cma_orig_acct_num','reported_inc_hit','ods_business_dt','diff_score','lag_score'] #dfBase_2lag
dfTotal=df1.drop(excludeCols,1)
print(dfTotal.shape)
dfTotal=dfTotal.drop_duplicates(keep="first")
print(dfTotal.shape)
# One-hot encode is used on x
dfTotal = pd.get_dummies(dfTotal, columns=catCols)
dfTotal = dfTotal.astype(float)
# xgboost can handle nan value property. But in general, we need to handle nan for other classification models.
dfTotal=dfTotal.fillna(-9999)
dfTotal.target.value_counts()
# Remove Collinear Variables
threshold = 0.8
corr_matrix = dfTotal.corr().abs()
print(dfTotal.shape)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
dfTotal = dfTotal.drop(columns = to_drop)
print(dfTotal.shape)
from sklearn.model_selection import train_test_split
x, y=dfTotal.drop('target',1), dfTotal['target']
train, test = train_test_split(dfTotal, test_size=0.25)
x_train=train.drop('target',1)
y_train=train['target']
x_test=test.drop('target',1)
y_test=test['target']
from sklearn.metrics import roc_curve
def ks_cal(y_test,y_pred):
fpr,tpr,thresholds=roc_curve(y_test,y_pred)
ks=max(tpr-fpr)
return ks
In this project, xgboost is the best model. Here, we use xgboost with default parameters to test the feature selection methodologies.
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(x_train, y_train)
print('importance type: '+str(imtype)+' , train roc_auc_score: '+ str(roc_auc_score(y_train, model.predict_proba(x_train).T[1])))
print('importance type: '+str(imtype)+' , test roc_auc_score: '+ str(roc_auc_score(y_test, model.predict_proba(x_test).T[1])))
We set a base line by using all variables without any feature selection conducted, the test ROC_AUC_score is 0.72.
for imtype in ['gain','weight','cover','total_gain','total_cover'] :
model = XGBClassifier()
model.fit(x_train, y_train)
feature_important = model.get_booster().get_score(importance_type=imtype)
keys = list(feature_important.keys())
values = list(feature_important.values())
dfImportances = pd.DataFrame(data=values, index=keys, columns=["importance"]).sort_values(by = "importance", ascending=False)
# data.plot(kind='barh')
dfImportances.head()
# select top by the gain
top=30
x_train2= x_train[dfImportances.head(top).index.tolist()]
x_test2= x_test[dfImportances.head(top).index.tolist()]
model.fit(x_train2, y_train)
print('importance type: '+str(imtype)+' , train roc_auc_score: '+ str(roc_auc_score(y_train, model.predict_proba(x_train2).T[1])))
print('importance type: '+str(imtype)+' , test roc_auc_score: '+ str(roc_auc_score(y_test, model.predict_proba(x_test2).T[1])))
According to the xgboost document (https://xgboost.readthedocs.io/en/latest/python/python_api.html), it uses 5 type feature importance, gain, weight, cover, total_gain and toal_cover. You can find the detail infroamtion from the document. Here, we loop through each type. The result shows, for our data, total_gain and total_cover perform better that the others. The total_gain achieves the highest ROC_AUC_socre of 0.702.
# select top by shap value
import shap
from statsmodels.api import add_constant
model.fit(x_train, y_train)
explainer=shap.TreeExplainer(model)
shap_values=explainer.shap_values(x_train)
features_order=pd.DataFrame(x_train.columns, columns=['Features'])
features_order['Shap Feature Importance']=np.sum(np.abs(shap_values),axis=0)
features_order=features_order.sort_values('Shap Feature Importance', ascending=False).reset_index(drop=True)
x_train_left=add_constant(x_train[features_order.iloc[:300]['Features'].tolist()])
features_order1=pd.DataFrame(x_train_left.columns,columns=['Features'])
features_order1=features_order1.merge(features_order,on='Features',how='left')
# features_order1["VIF Factor"]=[variance_inflation_factor(x_train_left.values, i) for i in tqdm(range(x_train_left.shape[1]))]
features_order1=features_order1[~features_order1['Features'].isin(['const'])]
# features_order1=features_order1[features_order1["VIF Factor"] < 5].reset_index(drop=True)
# select top by the shap value
x_train3= x_train[features_order1.head(30)['Features'].tolist()]
x_test3= x_test[features_order1.head(30)['Features'].tolist()]
model.fit(x_train3, y_train)
print(roc_auc_score(y_train, model.predict_proba(x_train3).T[1]))
print(roc_auc_score(y_test, model.predict_proba(x_test3).T[1]))
SHAP value is a novel ensemble learning measurement that is the unique consistent and locally accurate attribute value. It based on ideas from game theory and local explanations. You can find the detail documentation https://github.com/slundberg/shap#citations.
For our data set, by using shape value, and limit the same amount of variables, we get almost the same classification result. The SHAP value does not improve the model much. But it is always good to try.